Efficient sparse retrieval through embedding-based inverted index construction
Annotation
Modern search engines use a two-stage architecture for efficient and high-quality search over large volumes of data. In the first stage, simple and fast algorithms like BM25 are applied, while in the second stage, more precise but resource- intensive methods methods, such as deep neural networks, are employed. Although this approach yields good results, it is fundamentally limited in quality due to the vocabulary mismatch problem inherent in the simple algorithms of the first stage. To address this issue, we propose an algorithm for constructing an inverted index using vector representations combining the advantages of both stages: the efficiency of the inverted index and the high search quality of vector models. In our work, we suggest creating a vector index that preserves the various semantic meanings of vocabulary tokens. For each token, we identify the documents in which it is used, and then cluster its contextualized embeddings. The centroids of the resulting clusters represent different semantic meanings of the tokens. This process forms an extended vocabulary which is used to build the inverted index. During index construction, similarity scores between each semantic meaning of a token and documents are calculated which are then used in the search process. This approach reduces the number of computations required for similarity estimation in real-time. Searching the inverted index first requires finding keys in the vector index, helping to solve the vocabulary mismatch problem. The operation of the algorithm is demonstrated on a search task within the SciFact dataset. It is shown that the proposed method achieves high search quality with low memory requirements. The proposed algorithm demonstrates high search quality, while maintaining a compact vector index whose size remains constant and depends only on the size of the vocabulary. The main drawback of the algorithm is the need to use a deep neural network to generate vector representations of queries during the search process which slows down this stage. Finding ways to address this issue and accelerate the search process represents a direction for future research.
Keywords
Постоянный URL
Articles in current issue
- Multispectral optoelectronic system
- Study of the influence of laser wavelength on the dichroism effect in ZnO:Ag films
- Direct laser thermochemical writing on titanium films for rasterized images creation
- Algorithms of direct output-feedback adaptive control of a linear system with finite time tuning
- Large language models in information security and penetration testing: a systematic review of application possibilities
- Usage of polar codes for fixed and random length error bursts correction
- Method of semantic segmentation of airborne laser scanning data of water protection zones
- Directional variance-based algorithm for digital image smoothing
- DAS signal modeling using the generative adversarial neural network technique
- Multidimensional trajectory planning algorithm for a 5D printer slicer
- Scheduling distributed computations in non-deterministic systems
- Enhancing and extending CatBoost for accurate detection and classification of DoS and DDoS attack subtypes in network traffic
- Detection of L0-optimized attacks via anomaly scores distribution analysis
- Numerical study of SiO2 particle erosion of an aluminum alloy
- An approach to solving the problem of geomagnetic data scarcity in decision-making support
- Construction of matched distance function for simple Markov channel
- Application of the dynamic regressor extension and mixing approach in machine learning on the example of perceptron
- WaveVRF: post-quantum verifiable random function based on error-correcting codes